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Abstract 

Having a deep genetic structure evolved during its domestication and adaptation, tlie Asian cultivated rice 
{Oryza sativa) displays considerable physiological and morphological variations. Here, we describe deep 
whole-genome sequencing of the aus rice cultivar Kasalath by using the advanced next-generation sequen- 
cing (NGS) technologies to gain a better understanding of the sequence and structural changes among 
highly differentiated cultivars. The de novo assembled Kasalath sequences represented 91.1% 
(330.55 Mb) of the genome and contained 35 139 expressed loci annotated by RNA-Seq analysis. We 
detected 2 787 250 single-nucleotide polymorphisms (SNPs) and 7393 large insertion/deletion (indel) 
sites (>100 bp) between Kasalath and Nipponbare, and 2 216 251 SNPs and 3780 large indels between 
Kasalath and 93-1 1. Extensive comparison of the gene contents among these cultivars revealed similar 
rates of gene gain and loss. We detected at least 7.39 Mb of inserted sequences and 40.75 Mb of unmapped 
sequences in the Kasalath genome in comparison with the Nipponbare reference genome. Mapping of the 
publicly available NGS short reads from 50 rice accessions proved the necessity and the value of using the 
Kasalath whole-genome sequence as an additional reference to capture the sequence polymorphisms that 
cannot be discovered by using the Nipponbare sequence alone. 
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1 . Introduction 

Over the last decade, technological developments 
have led to the generation of an unprecedented 
amount of genomic data for model organisms, provid- 
ing basis for the discovery of their genes and under- 
standing their genetics. The sequence of the first plant 
genome, from tU& 6\cot Arabidopsis thaliana, was com- 
pleted and published at the end of 2000.' This se- 
quence has served as a common reference for gene 
annotation and comparative genomics.^'^ In particular, 
using the whole-genome sequence, information 



provided by the next-generation sequencing (NGS) 
technologies (the new data are emerging from the 
1 001 Genomes Project launched in early 2008; http:// 
1001genomes.org/) has dramatically increased the 
numbers of known genetic variants [up to several mil- 
lions of single-nucleotide polymorphisms (SNPs)] in 
this model plant. The monocot species Asian rice 
(Oryza sativa L.) is one of the most important cereal 
crops, feeding more than half of the global population, 
especially in Asian countries. The International Rice 
Genome Sequencing Project (IRGSP) deciphered the 
whole genome of the subspecies japonica cultivar 
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Nipponbare in 2005, and released a map-based high- 
quality sequence covering >95% of its genome."^ This 
sequence has provided a foundation for our under- 
standing of rice genome organization, including both 
genes and repetitive sequences, and accelerated func- 
tional genomic studies in rice. To date, ~700 rice 
genes controlling various morphological and physio- 
logical traits, including resistance to biotic and abiotic 
stresses, have been functionally characterized.^ With 
the Nipponbare sequence asa reference, genome re-se- 
quencing of a large numberof rice accessions has led to 
the discovery of millions of SNPs and insertion /deletion 
sites (indels), enabling genome-wide association 
studies (GWAS) aimed at identifying agronomically im- 
portant genes in rice.^'^ 

To meet the challenges deriving from rapid popula- 
tion growth and worldwide climate change, continuous 
efforts to increase rice production by using the genetic 
improvementtechnologies will be of great importance. 
One of the world's oldest crops (domesticated ~10 
000 years ago), rice is traditionally classified into two 
major subspecies, indica and japonica.^~^° Owing to 
the deep genetic structure of rice evolved during do- 
mestication and adaptation and its autogamous breed- 
ing system, current O. sativa cultivars and landraces 
can be subdivided in more detail into five genetically 
differentiated groups: indica, aus, aromatic, temperate 
japonica, and tropical japonicaV While the reference 
Nipponbare sequence is particularly useful for evolu- 
tionary and functional studies, its use forextensive ana- 
lysis of genome diversity remains limited because of 
considerable inter- and intra-species and even intra- 
subspecies chromosomal rearrangements, such as 
insertions and deletions, duplications, inversions, trans- 
locations, and transpositions.^ ^ Consistent with the 
above observations, the portion of uniquely mapped 
reads among the NGS short-read sequences from 50 
cultivated and wild rice accessions against the 
Nipponbare reference genome varied greatly, from 
73.0 to 93.0%, with the highest rate in temperate japon- 
ica accessions followed by tropical japonica, aromatic, 
aus, indica, and wild rice accessions.^ The power of 
GWAS for identifying rice genes depends greatly on 
the number and quality (high accuracy and even distri- 
bution along each rice chromosome) of SNPs, particu- 
larly when the analysis is conducted with germplasms 
collected within a subspecies or local populations.^ ^'^ ^ 
Moreover, the absence of some genes conferring toler- 
ance to submergence or phosphorus deficiency from 
the Nipponbare genome caused by DNA insertions 
or deletions has been reported, strongly indicating 
that a single-reference genome is insufficient for discov- 
ery of novel genes or comprehensive transcriptome 
analysis through the RNA-Seq technology in rice.' 
Because of thedeepgenetic structure inO. sflt;Vfl, there- 
after, new reference sequences from additional rice 



cultivars are needed, although chromosomal mapping 
and de novo assembly of the NGS reads are still challen- 

2122 

Rice cultivar Kasalath belongs to the /nt//cfl subspecies 
or aus group of O. sativa, which has higher genome di- 
versity than the japonica subspecies." Carrying a 
number of beneficial traits such as early maturity and 
tolerance to drought and phosphate deficiency, this cul- 
tivar, together with Nipponbare, has been particularly 
useful for developing a series of important genetic 
and genomic resources that have already contributed 
greatly to the molecular and functional analysis of rice 
chromosomes.' In this study, we sequenced 
the whole genome of Kasalath rice by using two NGS 
platforms, lllumina (GAIIx and HiSeq2 000) for short 
reads and Roche 454 (GS FLX Titanium and GS FLX+ 
Titanium) for long reads. We performed de novo as- 
sembly and chromosomal mapping of the NGS read 
sequences. In addition, we carried out the transcrip- 
tome analysis with RNA-Seq data obtained from young 
leaves and panicles of Kasalath by using the GAIIx for an- 
notation of expressed sequences. Comparative analysis 
of the Kasalath sequence and those of other rice cultivars 
confirmed its value as a new reference genome to facili- 
tate future evolutionary and functional genomic studies 
in rice. 

2. Materials and methods 

2.1 . Library construction and genome re-sequencing 
Total genomic DNA of Kasalath was extracted from 

young leaves of a single plant by using the cetyltrimethy- 
lammonium bromide method. ^° We constructed DNA 
libraries with insert sizes of 800-1 500 bp according 
to standard manufacturer's protocols (http://www. 
454.com/; Basel, Switzerland) to generate long-read 
sequences by using Roche 454 pyrosequencingtechnol- 
ogy (GS-FD< Titanium and GS-FD<+ platforms) as 
described previously^' We also constructed libraries 
with insert sizes of 2 50-400 bp according to the man- 
ufacturer's instructions (lllumina, San Diego, CA, USA) 
to produce short single or paired-end reads on the 
lllumina GAIIx or HiSeq 2000 platforms.^' To facilitate 
annotation of the expressed sequences, we constructed 
cDNA libraries with insert sizes of 350-400 bp from 
total RNA samples prepared from the young leaves or 
young panicles of Kasalath, and used these libraries to 
generate short-read RNA-Seq data on the lllumina 
GAIIx instrument as described. 

2.2. Genome assembly 

Raw sequence read data generated on both platforms 
were preprocessed to trim low-quality or adapter 
sequences on both ends as described previously.^' 
Sequencing errors in the lllumina data were corrected 
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by String Graph Assembler (SGA) software v. 0.0.2 0 
with 7^-mer= 55'.^^ 

To construct the Kasalath pseudomolecules 
(Supplementary Fig. SI ), we first performed de novo as- 
sembly of 454 reads into sequence contigs by using 
Celera Assembler v. 7.0 software with utgErrorRate = 
0.015, oviErrorRate = 0.03, cnsErrorRate = 0.05, 
cgwErrorRate = 0.05, utgCraphErrorRate = 0.01 5, 
utgMergeErrorRate = 0.02, and default values for 
other options. To improve sequence accuracy, we then 
mapped the error-corrected lllumina reads to the 
above contigs by Burrows-Wheeler Alignment (BWA) 
V. 0.6.2 software with the '-e 1 0' option.^"^ With the 
mapped paired-end reads, we further refined the align- 
ments around the indel sites by using Genome Analysis 
Toolkit (GATK)^^ software and discarded the putative 
polymerase chain reaction duplicates by using Picard 
software (http://picard.sourceforge.net/). Errors in 
each sequence contig were detected by calling variants 
using the SAMtools mpileupfunction with '-q 20 -Q 20' 
options.^^ Errors were corrected if the detected variants 
were homozygous with a quality score of > 30, sequen- 
cing depth of >1 0, and frequency of >70%. This error 
correction procedure was performed twice to ensure 
sequencing accuracy. After again mapping the error- 
corrected lllumina reads to the error-corrected 454 
contigs, we finally conducted a hybrid de novo assembly 
by merging the error-corrected 454 contigs with the 
unmapped lllumina reads by using SGA with '-m 75 -d 
0.4 -g 0.1 -r 30' options. 

2.3. Generation of Kasalath pseudomolecules 

All contigs of >500 bp were subjected to chromo- 
somal mapping. First, we physically mapped their 
sequences to the Nipponbare reference genome (IRGSP 
1.0)^^ by using MUMmer v. 3.23 software (NUCmer) 
with default settings.^^ We selected the optimal align- 
ments by using delta-filter commands; the coordinates 
of each contig were displayed by using the show-coords 
command. All aligned contigs with values below the 
thresholds (90% nucleotide identity and 80% sequence 
coverage) were removed. If a contig was split into two 
or more fragments, we considered that it might corres- 
pond to genomic sites with large indels relative to the 
Nipponbare sequence. To determine the insertion sites, 
we used a fixed threshold of unaligned fragments of 
>100bp with flanking sequences of >200 bp 
(Supplementary Fig. S2A). To determine the deletion 
sites, we used gapped alignments of 1 00-50 000 bp 
with flanking sequences of >200 bp (Supplementary 
Fig. S2B).^** 

Bacterial artificial chromosome (BAC)-end sequences 
(BESs) from Kasalath were used to anchor the sequence 
contigs that could not be aligned to the Nipponbare 
genome by MUMmer. We mapped all Kasalath BESs 



(DDBJ accessions AG831 1 74-AG909573; http://rgp. 
dna.affrc.go.jp/E/publicdata/kasalathend map/index, 
html) onto the Kasalath contigs by BLASTN algorithm 
with 'e-value 1.0e~^' option. We selected the best 
positions with >90% nucleotide identity and > 9 5% se- 
quence coverage, and used only the uniquely aligned 
BESs for further analysis. We also mapped the Kasalath 
BESs to the Nipponbare genome sequence by selecting 
the best positions with >90% nucleotide identity and 
>90% sequence coverage, and retained only the pairs 
of BESs uniquely mapped at a distance of <300 kb on 
the Nipponbare genome. Unmapped Kasalath contigs 
were anchored onto the Nipponbare genome if they 
(i) contained uniquely aligned BESs mapped onto a 
Nipponbare genomic region where no Kasalath 
contigs have been assigned by MUMmer and (ii) had 
the mates of BESs aligned on a different contig already 
mapped on the Nipponbare sequence by MUMmer. 
Finally, we used the lllumina paired-end sequences 
to anchor the remaining Kasalath contigs to the 
Nipponbare sequence in the same manner as for the 
construction of chromosome pseudomolecules. 

2.4. Transcriptome analysis 

The RNA-Seq reads of Kasalath were mapped onto its 
pseudomolecules byTophatv. 2.0.8b software with the 
'-min-intron-length 67 -max-intron-length 3608' 
options."^" The thresholds for the intron length corre- 
sponded to the 1 stand 99th percentilesof the distribu- 
tion of intron length, as retrieved from the annotations 
in the Rice Annotation Project (RAP) database."*^ In add- 
ition, we set the '-G' option on the basis of the intron/ 
exon structures in the pseudomolecules converted 
from the Nipponbare genome annotated by the RAP 
Gene structures predicted by Cufflinks v. 2.1.1 software 
individually for the young leaves and young panicles 
were merged byCuffmergesoftware."^^ DNA sequences 
of predicted transcripts were mapped onto the 
Nipponbare genome or proteome"^' sequences by the 
BLASTN algorithm and est2genome tool'^^''^'^ with 
thresholds of >90% nucleotide identity and >70% se- 
quence coverage. 

2.5. Detection ofSNPs and indels among rice cultivars 
Pseudomolecule sequences were compared among 

japonica rice Nipponbare (IRGSP 1.0, http://rapdb. 
dna.affrc.go.jp/), indica rice 93-1 1 (http://rise2. 
genomics.org.cn/page/rice/index.jsp), and aus rice 
Kasalath by using the MUMmer program to detect the 
existence of SNPs and indels. To ensure that large 
indels (> 1 00 bp) between any two cultivars were not 
due to misassembled contigs, we mapped all lllumina 
reads of the Kasalath genome to its pseudomolecule 
sequences by using BWA to confirm thatthe boundaries 
of the insertions (Supplementary Fig. S2A) and 
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deletions (Supplementary Fig. S2 B) were covered by at 
least five overlapping reads stepped over by their 
paired sequences. Each SNP and indel was annotated 
by SnpEff (http://snpeff.sourceforge.net/index.html) 
to predict the effects of variants on genes. 

2.6. Chromosomal mapping of the publicly available 
short-read sequences by using multiple rice 
pseudomolecules 
Publicly available sequence data generated by the 
llluminaGAII instruments from 50 accessions of culti- 
vated and wild rice at ~15x coverage were down- 
loaded from the NCBI Short Read Archive (accession 
number SRA0231 1 6).^ By using BWA,^"^ we aligned 
these sequences to the pseudomolecule sequences of 
Nipponbare, Kasalath, and 93-1 1 to examine the 
efficiency of chromosomal mapping. By using TASUKE, 
a web-based application developed recently for visual- 
ization of large-scale re-sequencing data,'*^ we con- 
structed a genome viewer to display the sequences of 
and the structural variations among the above rice 
accessions with reference to the genomic sequence of 
Kasalath instead of Nipponbare. 



3. Results and discussion 

3.7. Kasalath pseudomolecules constructed from 
330.55 Mb of sequences 
By performing de novo assembly of 2.49 Gb of 
sequences (>6x coverage) generated by Roche 454 
(1 .73 Gb from GS-FLX Titanium with an average read 
length of 386.1 bp, and 0.76 Gb from GS-FLX+ with an 
average read length of 593.8 bp) with Celera Assembler, 
we created 1 09 362 contigs containing 296.3 Mb with 
an N50 length (minimum length of contigs representing 
50% of the assembly) of 3.2 kb. To increase coverage and 
accuracy of genomic sequences, we additionally gener- 
ated a total of 57.47 Gb of Kasalath sequences 
(>148x coverage) by using lllumina GAIIx and 
HiSeq2000. On the basis of trimmed and error-corrected 
lllumina sequences, we corrected the sequencing errors 
within all contigs initially assembled from the Roche 
454 reads. Finally, we conducted the hybrid de novo as- 
sembly by using all of the above-sequence data from 



Table 1. Statistics of de novo assembly and chromosomal mapping 
of Kasalath NGS reads to Nipponbare pseudomolecules 





No. of 


N50 


Maximum 


Mean 


Total L (bp) 




contigs 


(bp) 


L(bp) 


L(bp) 




Mapped 


36 936 


1 3 728 


1 03 1 31 


7847 


289 796 664 


Unmapped 


14 822 


361 5 


43 777 


2749 


40 748 81 3 



L, length; N50, minimum length of contigs representing 50% 
of the assembly. 



both lllumina and Roche 454, which produced 51 550 
contigs containing a total of 330.55 Mb non-overlap- 
pingsequences, which correspondsto 88.6%of the pub- 
lished Nipponbare sequence (373.25 Mb) (Table 1). 
Approximately 72% (36 932) of these Kasalath contigs, 
corresponding to 87.7% (289.80 Mb) of the total 
assembled sequences, were successfully anchored to 
the 1 2 chromosomes (Supplementary Table SI), cover- 
ing 292.49 Mb of the Nipponbare reference genome. 
The total length (35 914 803 bp) of Kasalath contigs 
anchored on chromosome 1 by the hybrid de novo as- 
sembly was longerthan that (32 835 386 bp) achieved 
only by single-reference (Nipponbare) mapping using 
BWA (details not shown). By using the MUMmer align- 
ment software, we mapped the above two sequences 
to a BAC-based genomic sequence (41.37Mb) of 
Kasalath chromosome 1 (>99% nucleotide identity), 
which revealed chromosome coverage of 82.3 and 
73.0%, respectively. Therefore, our assembly of NGS 
reads can link genomic sequences to the Kasalath 
chromosomal regions, which is not possible by using ref- 
erence mapping only. Such chromosomal regions could 
be cultivar-specific; these differences may be caused by 
the insertions or deletions of DNA segments, and in 
some cases they might be of great importance for the 
maintenance and use of rice genetic resources. For 
example, recent cloning and functional analysis of a 
majorquantitative trait locus for phosphorus-deficiency 
tolerance (Pupl) in rice placed this gene within a 
chromosomal region of ~90 kb on Kasalath chromo- 
some 11, which was absent in the Nipponbare 
genome.^ ^ In our present study, a total of 1 8 contigs 
(60.2 5 kb) were successfully assembled from the 
above genomic region of Kasalath, which fully covered 
the Pup 7 -specific protein kinase gene (PSTOLI). 



Table 2. Statistics of SNPs and indels detected between Kasalath, 
Nipponbare, and 93-1 1 genomes 





Kasalath- Nipponbare 


Kasalath-93- 


•1 1 


SNPs 


Large indels 


SNPs 


Large indels 


chrOl 


31 8 375 


921 


252 557 


506 


chr02 


253 524 


651 


1 84 485 


383 


chr03 


278 482 


749 


204 71 2 


405 


chr04 


233 426 


61 7 


21 2 291 


329 


chr05 


244 1 49 


61 6 


1 79 973 


294 


chr06 


241 274 


648 


21 0 1 55 


373 


chr07 


231 566 


605 


1 55 91 0 


300 


chr08 


214 710 


535 


1 66 666 


255 


chr09 


1 79 050 


434 


1 41 896 


232 


chrl 0 


1 93 491 


531 


1 49 21 2 


241 


chrl 1 


211 931 


533 


1 85 340 


195 


chrl 2 


1 87 272 


553 


1 73 054 


267 


Total 


2 787 250 


7 393 


2 21 6 251 


3780 
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3.2. Genome-wide diversity among Kasalatti, 
Nipponbare, and 93-1 1 

Sequence alignment by MUMmer revealed SNPs at 2 
787 250 nucleotide sites between Kasalath and 
Nipponbare (alignment length of 2 78.75 Mb) and 2 
21 6 251 nucleotide sites between Kasalath and 93- 
1 1 (alignment length of 259.00 Mb) (Table 2). Thus, 
theSNPfrequency was 1 .00% between ous and japonica 
and 0.86% between aus and indica, consistent with the 
genetic structure of O. sativa reported so far.^~^^ 
Kasalath and 93-1 1 shared 1 378 591 common SNPs 
in comparison with the Nipponbare genome, which 
provides useful genomic resources for future studies of 
domestication and subspeciation of Asian cultivated 
rice. The SNPs present only between Kasalath and 93- 
1 1 , two closely related cultivars, offer great potential 
for the discovery of naturally occurring mutations that 
might be associated with recent phenotypic changes 
that appeared during local adaptation after the diver- 
gence of japonica and indica. 

This genomic information should help to explain in- 
depth the molecular mechanisms underlying not only 
the evolution, but also the functions of rice genomes. 
A total of 37 869 genes have been annotated in the 
Nipponbare genome."^^ We found that most of the 
SNPs resided in non-genic regions. Only 5.1% (142 
366) of the total SNPs detected between Nipponbare 
and Kasalath were located within protein-coding 
regions (Fig. 1 ). SNPs creating premature stop codons 
(nonsense mutations) or altering splice-site motifs 
can be expected to cause harmful effects on gene and 
protein function and eventually loss of function. We 
examined SNP presence and locations within 26 132 
genes with sequences fully aligned between the 
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Figure 1. Distribution pattern of SNPs detected between the 
genomes of Nipponbare and Kasalathi cultivars. 



Nipponbare and Kasalath genomes, and discovered 
that 902 genes had premature stop codons or splice- 
site motifs altered; of these, 33 seemed to have been 
pseudogenized in Kasalath. To compare the expression 
of the genes with and without harmful SNPs in the two 
cultivars, we carried out whole-transcriptome analysis 
in Nipponbare and Kasalath by using the RNA-Seq 
data for young leaves and young panicles. The fraction 
of genes specifically expressed in Nipponbare was sig- 
nificantly higher among the genes carrying harmful 
SNPs than among the genes without such SNPs (P< 
1 0~^), suggesting that genes with harmful mutations 
are subject to pseudogenization. 

Furthermore, we detected large indels at 7393 
genomic sites between Kasalath and Nipponbare 
(100-38 041 bp; average length 1 999 bp) and at 
3780 genomic sites between Kasalath and 93-1 1 
(1 00-1 5 333 bp; average length 735 bp) (Table 2), 
corresponding to large indel frequency of 0.003% 
(aus-japonica) and 0.001% (aus-indica). The total 
amount of indel nucleotides (completed sequences) 
in Kasalath relative to Nipponbare was 14.78 Mb 
(5026 deletions, 13.49 Mb; 2367 insertions, 
1.29 Mb); much fewer indel sites were observed in 
Kasalath relative to 93-1 1 (2244 deletions, 1.84 Mb; 
1,536 insertions, 0.94 Mb). We detected many more 
chromosomal sites for deletions than for insertions, 
probably owing to inefficiencies of de novo assembly of 
NGS short reads and chromosomal mapping of 
assembled contigs, especially for the genomic regions 
carrying recently duplicated segments or highly repeti- 
tive sequences. When we took into account the inser- 
tions containing partially assembled sequences, the 
total length of inserted sequences in Kasalath increased 
to 7.39 Mb, which, however, wasstill much lessthanthat 
of the deleted sequences (1 3.49 Mb). These findings 
imply that the genome of Kasalath (estimated size of 
363 Mb) is slightly smaller than that of Nipponbare 
(384-387 Mb).^' The distribution pattern of deletion 
sizes in Kasalath against Nipponbare displayed two 
peaks if deletions of <1 kb were ignored (Fig. 2 and 
Supplementary Fig. S2). The first and the largest peak 
appeared at 4 kb (3-5 kb), in which 58.5% of nucleo- 
tides were from repetitive sequences. The second peak 
was at 1 2-1 3 kb (1 1 -1 4 kb), in which up to 62.7% of 
the nucleotides were from repetitive sequences. These 
data reveal the involvement of trans posable elements, 
particularly those from the long-terminal-repeat retro- 
transposon families.^ ^ 

3.3. Gain and loss of genes in Kasalath, Nipponbare, 
and 93-1 1 

About 72.0% of the Nipponbare chromosomal sites 
(67.6 Mb) uncovered by Kasalath pseudomolecules 
were masked as repetitive sequences. Clearly, up to 
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Figure 2. Size distribution and sequence classification of large 
deletions in Kasalath in comparison with Nipponbare. 

89.0% of the transcript sequences of Nipponbare were 
rescued by the assembled contigs of Kasalath (Fig. 3). 
This result indicates that most of the genie regions 
in Kasalath were captured through our re-sequencing 
and genome assembly. However, we still found that 
6.3% (2 82 8) ofthe total transcripts (44 536, including 
alternative variants) annotated in Nipponbare were 
likely absent in Kasalath (exon coverage <5%). To 
clarify whether these missing transcripts represented 
real changes of gene content between these cultivars, 
we examined the gene coverage by aligning the 93- 
1 1 genome (Jndica) to the Nipponbare reference 
genome. Interestingly, a similar number of the 
Nipponbare transcripts (2 904) were likely missing in 
the 93-1 1 genome, of which 1 278 were also absent 
in Kasalath {aus). These results clearly indicate that at 
least 3.1% of the genes in the japonica cultivar 
Nipponbare(l 1 74of37 869 genes, excludingalterna- 
tive variants) lack orthologs in indica and aus cultivars, 
mainly because of insertions or deletions. The fre- 
quency of the Nipponbare genes absent in Kasalath or 
93-1 1 seemed to vary among the 1 2 chromosomes; 
chromosome 3 had the lowest value of 7.6 genes 
absent per Mb (Supplementary Fig. S3). As expected, 
an extremely high frequency (21.6 genes absent per 
Mb) was observed on chromosomes 11 and 12, 
which are characterized by recent generation of gene 
copies by tandem gene amplification and segmental 
duplication in the Nipponbare genome."^*^ Since these 
two chromosomes are known to carry the genes for 
agronomically important traits (such as resistance to 
blast, bacterial blight, viruses, and insects; photo- 
period-sensitive male sterility; and salt tolerance),"^^ 
our comparison ofthe genomic sequences of different 
rice cultivars should provide fundamental information 
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Figure 3. Nipponbare transcripts covered by Kasalath 
pseudomolecule sequences. The horizontal axis represents the 
sequence coverage (x100%) of each gene annotated on 
Nipponbare pseudomolecules. 



useful for our understanding ofthe evolution and func- 
tion of these genes to the benefit of future molecular 
breeding programmes. 

We obtained 2.3 Gb ofthe RNA-Seq reads from young 
leavesof Kasalath and 2.9 Gb from young panicles. This 
enabled us to perform whole-transcriptome analysis of 
the aus rice genome. By mapping these two datasets to 
Kasalath pseudomolecule sequences (all assembled 
sequences), we annotated 55 1 88 transcripts compris- 
ing 35 1 39 loci (Supplementary Fig. S4). To estimate 
the gain of genes in aus rice in comparison with japonica 
rice, we aligned all Kasalath transcript sequences to the 
Nipponbare pseudomolecules or protein sequences 
from the proteome database."^' A total of 2664 tran- 
scripts remained unmapped (<90% nucleotide iden- 
tity and <70% sequence coverage); of these, 1226 
unique to Kasalath (<50% sequence coverage). Ofthe 
1226 transcripts, the translated sequences of 789 
matched 535 known proteins. Analysis of thefunction- 
al protein domains encoded by these transcripts 
revealed that protein kinases and disease resistance- 
related proteins were over-represented (Table 3), sup- 
porting the previous results of comparative genome 
analysis of Asian cultivated rice.'^^ 

3.4. Chromosomal mapping of publicly avaiiabie NGS 
short reads from 50 rice accessions to multiple 
reference sequences 
The map-based, high-quality sequence of Nipponbare 
has been typically used as a reference for not only com- 
parative, but also functional genomics.^'^'"^^"^' In the 
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Table 3. Top 1 0 over-represented functional domains in the genes 
found in Kasalath but not in Nipponbare 

IPR000719 Protein kinase, catalytic domain 

IPR000767 Disease resistance protein 

IPR001 245 Serine-threonine/tyrosine-protein kinase catalytic 
domain 

1PR001 61 1 Leucine-rich repeat 

1PR0021 82 NB-ARC 

IPR0082 71 Serine/threonine-protein kinase, active site 

IPR01 1 009 Protein kinase-like domain 

IPR01 321 0 Leucine-rich repeat-containing N-terminal, type 2 

IPR01 3320 Concanavalin A-like lectin/glucanase, subgroup 

IPR01 7441 Protein kinase, ATP-binding site 

present study, comparative analysis of genomic 
sequences of Kasalath, Nipponbare, and 93-1 1 led to 
the discovery of cultivar-specific sequences; some \Nere 
associated with genes of agronomic importance such 
as Pupl in the Kasalath genome. About 7.39 Mb of 
inserted sequences were detected in Kasalath relative 
to Nipponbare, and 40.75 Mb of Kasalath sequences 
still remained unmapped to its chromosomes. This 
result emphasizes the necessity and importance of 
using pseudomolecule sequences as additional refer- 
ences for comparative genomic studies in rice to 
understand comprehensively its genome diversity, par- 
ticularly among the cultivars of the indica subspecies 
and flws-type cultivars. We mapped the publicly avai- 
lable lllumina short reads derived from 50 diverse 
landraces and wild rice accessions^ to the pseudomole- 
cule sequences of Kasalath, Nipponbare, and 93-1 1 
(Supplementary Table S2). The mapping rate of unique 
reads (uniquely mapped reads/total reads x 1 00%) 
varied widely between the accessions, from 62.3 to 
84.8% (Fig. 4). As expected, more sequence reads from 
the aus and indica varieties were mapped to the 
Kasalath and 93-1 1 genomes than to the Nipponbare 
genome, except for one tropical japonica accession 
(IRGC43397), which might have been previously mis- 
grouped by phylogenetic analysis or its genomic DNA 
used for genotyping and sequencing was mislabelled. 
On the other hand, the mapping rates were low for all 
accessions when the 93-1 1 pseudomolecule sequence 
was used asa reference. This result indicates certain lim- 
itations in using the current 93-1 1 sequence for exten- 
sive comparative genomic studies in rice, probably 
because of its lower accuracy or poorer quality of se- 
quence assembly than those of the Nipponbare and 
Kasalath pseudomolecules.Arecentstudy has been per- 
formed to improve sequence quality and chromosome 
coverage by re-sequencing the 93-11 genome up to 
36-fold depth.'^^ Detailed data on the gene annotation 
and the sequence and structural variations among the 
50 rice accessions obtained in the present study by 
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Figure4. Rateof uniquely mapped NGSreadsfrom 50 rice accessions 
by using Nipponbare, Kasalath, and 93-1 1 pseudomolecule 
sequences as references. Arabic numerals under the horizontal 
axis represent different accessions of cultivated and wild rice (see 
Supplementary Table 52 for details). 



using the Kasalath pseudomolecules as a reference 
are accessible through our genome viewer (http:// 
rice50ks.dna.affrc.go.jp/) developed on the basis of the 
TASUKE program (Supplementary Fig. S4).'^^ 

3.5. Conclusions 

In this study, we performed deep sequencing (>1 54- 
fold coverage) by using NGS technologies and de novo 
assembly of the whole genome of the aus rice cultivar 
'Kasalath'. The assembled sequences cover 91.1% of 
the whole genome and 89.0% of the transcribed 
regions annotated on the basis of the reference 
Nipponbare genome. Besides millions of SNPs, compara- 
tive genomics revealed genome-wide sequence and 
structural variations, including thousands of large indels 
associated with the gain or loss of genes, between japon- 
ica, indica, and flus-type rice cultivars. Chromosomal 
mapping of the publicly available NGS reads from 50 
rice accessions to Kasalath pseudomolecules demon- 
strated that its genomic sequence should be extremely 
useful as a new reference for future comparative 
genomic studies, particularlyforcapturing the sequence 
polymorphisms that could not be obtained by using the 
Nipponbare pseudomolecule sequences alone. 



Accession numbers 

The genomicand RNA-seq sequences of Kasalath rice 
reported in this paper have been deposited in the DDBJ 
database with accession numbers DRA000968 and 
DRAOOl 099. 
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